Open
Conversation
Single-node RL training of Qwen3-8B with GRPO on 8x H100-80GB using Anyscale. Includes Dockerfile, job config, and entrypoint script that handles model download, weight conversion, and async GRPO training with Megatron backend (TP=2, DP=2) and 3 SGLang rollout engines.
- Remove ray job submit, call python directly - Move env vars to appropriate locations (PYTHONPATH in Dockerfile, CUDA_DEVICE_MAX_CONNECTIONS in job.yaml) - Simplify entrypoint.sh (remove unused vars, fix paths) - Add timeout_s to job.yaml - Restructure README to match other examples pattern - Rename Dockerfile.anyscale -> Dockerfile - Change python3 -> python throughout Signed-off-by: Robert Nishihara <rkn@anyscale.com>
- Replace instance_type with required_resources and required_labels - Specify H100 accelerator type using ray.io/accelerator-type label - Define resource requirements: 8 CPUs/32Gi for head, 96 CPUs/512Gi/8 GPUs for workers - Allows Anyscale to select optimal H100 instance type (e.g., p5.48xlarge) Signed-off-by: Robert Nishihara <rkn@anyscale.com>
- Update worker resources to match p5.48xlarge specs: 192 vCPUs, 2048Gi memory - Keeps 8 H100 GPUs with H100 accelerator type label Signed-off-by: Robert Nishihara <rkn@anyscale.com>
- Add convert_weights_remote.py wrapper with @ray.remote(num_gpus=1) - Ensures weight conversion runs on GPU worker instead of head node - Fixes 'No NVIDIA driver' error when running conversion Signed-off-by: Robert Nishihara <rkn@anyscale.com>
- Create train_remote.py with @ray.remote(num_gpus=4) - Ensures training runs on GPU workers instead of head node - Both weight conversion and training now use Ray remote Signed-off-by: Robert Nishihara <rkn@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Files
miles_qwen3_8b_h100/Dockerfile.anyscalemiles_qwen3_8b_h100/job.yamlm5.2xlargehead + 1xp5.48xlargeworker)miles_qwen3_8b_h100/entrypoint.shmiles_qwen3_8b_h100/README.mdCluster Layout
Test plan
anyscale job submit -f job.yaml